The Compilation Process
Compilation involves five main steps
Entire code analysed at each step
Compiler can look at every line of code with surrounding context
Efficient and optimised machine code (object code) can be generated
Lexical Analysis
Syntax Analysis
Semantic Analysis
Code Generation
Code Optimisation
Interpreters use the first three steps to find the next instruction
- Cannot do any optimisations because they do not 'look ahead' at the whole code
- Might encounter an error in the code while program is running
Java compiler uses all five steps to produce optimised bytecode that is then interpreted by the JVM
Lexical Analysis
Lexical analysis is also known as scanning
The source code is just a text file containing a long string of characters
- The scanner uses whitespace and punctuation to identify sequences of characters
- Often implemented as a finite state machine
- Some scanners use backtracking to look at surrounding context
- This turns the code into a series of lexemes (or tokens)
After this step, the compiler will have a list of the tokens that make up the program - Identifiers (variable, parameter, and method names chosen by the programmer)
- Keywords (that have some pre-defined meaning in the language)
- Separators (such as brackets and punctuation)
- Operators (mathematical symbols that can be evaluated)
- Literals (such as strings, constants, numbers)
- Comments (stripped out at this stage (no longer needed))
Example
total = (a + b) * vat;
The scanner will split this into a list of tokens based on whitespace and punctuation
IDENTIFIER total
EQUALS
OPEN_PAREN
IDENTIFIER a
PLUS
IDENTIFIER b
CLOSE_PAREN
MULTIPLY
IDENTIFIER vat
SEMICOLON
The source file will be turned into a long list of tokens that are passed to the next step
Syntax Analysis
Syntax analysis is also known as parsing
The list of tokens is checked to see if it forms a valid program
- The parser knows all the grammar rules for the language
- It checks the sequence of tokens to see which rules they match
- Often implemented as a recursive algorithm with backtracking
- This generates an abstract syntax tree (AST)
Syntax is correct if the entire list of tokens can be matched to a sequence of valid rules
If one or more tokens cannot be matched to a set of rules, there must be a syntax error, so the compiler stops and outputs a message
The AST arranges tokens into a structure that represents different parts of the code
- Variable declarations
- Conditions
- Loops
- Functions
- Parameters
- etc.
This AST is passed to the next step
Example
int modulo(int x, int y) {
int result = (x % y);
return result
}
![[Pasted image 20250519093813.png]]
This is the modulo function's AST
Semantic Analysis
Semantic analysis adds meaning to the program
- Identifiers are added to a symbol table
- Performs type checking (in strongly typed languages (eg. Java))
- Variable usage and function calls are matched to their definitions (object binding)
- Memory requirements will be determined for each variable, parameter, etc.
- Nodes of the syntax tree will be augmented with this semantic information
The compiler will stop and output a message if semantic checks fail - Type mismatch (variables and parameters)
- Using undeclared variables
- Using a reserved keyword as an identifier
- Multiple declarations of the same identifier
- Using a variable that is out of scope
The augmented syntax tree and symbol table are passed to the next step
Code Generation
Code generation turns the AST into actual machine code instructions
- Adds instructions to reserve memory for each variable in the symbol table
- Walks through each node in the syntax tree and turns it into CPU instructions
- Walking node by node will generate code in the correct order
Code generator must: - Specify correct instructions (from CPU instruction set)
- Select appropriate CPU registers to hold variables during calculations
- Create a subroutine (with label) for each high level function in the syntax tree
- Add jump instructions to implement each condition and loop in the syntax tree
- Add call, push, pop instructions to set up subroutines with correct parameters
- Add debugging information (optionally) so programmer can step through code
The generated assembly code will be passed onto the next step
Most compilers generate machine code directly without going via assembly code first
![[Pasted image 20250519094431.png]]
Here's the generated assembly code for the modulo function
Code Optimisation
During the optimisation step the code is analysed to see if there are ways to
- Reduce amount of code
- Eliminate repeated operations
- Reorganise parts of the program to execute faster
There are lots of optimisation techniques
Optimisation can happen in multiple places during the compilation process
- Scanning
- Parsing
- Code generation
Code Optimisation Techniques
- Remove redundancy - Store calculated results so they can be used again later instead of performing the calculation again
- Remove code - Such as unnecessary calculations and intermediate variables
- Unroll loops - Generate the same instructions in a linear manner instead of jumping back in the code (in some cases this helps with CPI instruction pre-fetching)
- Reverse loops - So the counter works in the opposite direction (reduces number of instructions required)
- Increase locality - Put related code and data next to each other in memory to help with the principle of locality
- Use parallelism - Make use of multiple CPU cores by identifying and arranging instructions that can be carried out concurrently
- Fold constants - Replace calculations with their results instead of performing the calculation in the code (eg. replace 6+5 with 11)
Loop Invariant Code
Consider this fragment of code:
for(int i = 0; i <= n; i++) {
foo = amp + 5;
sum = sum + (foo * i);
}
The calculation of foo happens each time round the loop
- But the value of
ampis never changed inside the loop - So
foowill always be the same on each iteration - Wastes CPU time doing same calculation again and again
The code optimisation step will look for loop invariant code and rearrange it
foo = amp + 5;
for(int i=0; i<=n; i++) {
sum = sum + (foo*i);
}
Strength Reduction
Consider this fragment of code:
for(int i = 0; i<=n; i++) {
sum = sum + i;
}
The loop can be removed completely and replaced with a simple calculationsum = n * (1+n)) /2;
This does exactly the same thing (sums all integers from 1 to n)
- Uses fewer instructions and variables
- Eliminates the need to jump back in the machine code (to implement the loop)
Strength reduction replaces less efficient code with more efficient code where possible
Code optimisation can make the harder to understand
Linking Phase and Symbol Resolution
A large program will be split across multiple source files
- Each file is compiled separately into its own object file
- Source files often refer to each other or to libraries
- Symbols (variables, functions) defined in one file can be used in another
The linker is responsible for combining individual object files into the final executable - It can see the symbol table in each object file (and in each imported library)
- It will look for a symbol called main
- It will cross-check symbol tables to make sure every reference is valid
The linker will stop and output an error message if: - The code uses a symbol that isn't defined in any of the source files or libraries
- There is no symbol called main